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Abstract 

Statistical spoken dialogue systems have 
the attractive property of being able to 
be optimised from data via interactions 
with real users. However in the rein¬ 
forcement learning paradigm the dialogue 
manager (agent) often requires significant 
time to explore the state-action space to 
learn to behave in a desirable manner. 
This is a critical issue when the system is 
trained on-line with real users where learn¬ 
ing costs are expensive. Reward shaping 
is one promising technique for addressing 
these concerns. Here we examine three re¬ 
current neural network (RNN) approaches 
for providing reward shaping information 
in addition to the primary (task-orientated) 
environmental feedback. These RNNs are 
trained on returns from dialogues gener¬ 
ated by a simulated user and attempt to 
diffuse the overall evaluation of the dia¬ 
logue back down to the turn level to guide 
the agent towards good behaviour faster. 

In both simulated and real user scenarios 
these RNNs are shown to increase policy 
learning speed. Importantly, they do not 
require prior knowledge of the user’s goal. 

1 Introduction 

Spoken dialogue systems (SDS) offer a natural 
way for people to interact with computers. With 
the ability to learn from data (interactions) sta¬ 
tistical SDS can theoretically be created faster 
and with less man-hours than a comparable hand¬ 
crafted rule based system. They have also been 
shown to perform better ( [Young et al., 20f3 ). 
Central to this is the use of partially observable 
Markov decision processes (POMDP) to model di¬ 
alogue, which inherently manage the uncertainty 
created by errors in speech recognition and seman¬ 
tic decoding ( [Williams and Young, 2007| ). 


The dialogue manager is a core component of 
an SDS and largely determines the quality of in¬ 
teraction. Its behaviour is controlled by a pol¬ 
icy which maps belief states to system actions (or 
distributions over sets of actions) and this policy 
is trained in a reinforcement learning framework 
( [Sutton and Barto, 1999[ ) where rewards are re¬ 
ceived from the environment, the most informative 
of which occurs only at the dialogues conclusion, 
indicating task success or failure^ 

It is the sparseness of this environmental re¬ 
ward function which, by not providing any infor¬ 
mation at intermediate turns, requires exploration 
to traverse deeply many sub-optimal paths. This 
is a significant concern when training SDS on¬ 
line with real users where one wishes to minimise 
client exposure to sub-optimal system behaviour. 
In an effort to counter this problem, reward shap¬ 
ing ( [Ng et al., 1999[ ) introduces domain knowl¬ 
edge to provide earlier informative feedback to the 
agent (additional to the environmental feedback) 
for the purpose of biasing exploration for discov¬ 
ering optimal behaviour quicker]^ Reward shaping 
is briefly reviewed in Section [XT] 

In the context of SDS, [Ferreira and Lefe\^ 
(2015[ ) have motivated the use of reward shap¬ 


ing via analogy to the ‘social signals’ naturally 
produced and interpreted throughout a human- 
human dialogue. This non-statistical reward shap¬ 
ing model used heuristic features for speeding up 
policy learning. 

As an alternative, one may consider attempting 
to handcraft a finer grained environmental reward 


uniform reward of -1 is common for all other, non¬ 
terminal turns, which promotes faster task completion. 

^Learning algorithms are another central element in im¬ 
proving the speed of convergence during policy training. In 
particular the sample-efficiency of the learning algorithm can 
be the deciding factor in whether it can realistically be em- 
ploye d on-line. See e.g. the GP-SARSA ([Gasic and Young, 
2014 1 and Kalman temporal-difference { [Daubigney et al., 
20141 methods which bootstrap estimates of sparse value 
functions from minimal numbers of samples (dialogues). 
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function. For example, [Asri et al. (2014| ) proposed 
diffusing expert ratings of dialogues to the state 
transition level to produce a richer reward func¬ 
tion. Policy convergence may occur faster in this 
altered POMDP and dialogues generated by a task 
based simulated user may also alleviate the need 
for expert ratings. However, unlike reward shap¬ 
ing, modifying the environmental reward function 
also modifies the resulting optimal policy. 

We recently proposed convolutional and recur¬ 
rent neural network (RNN) approaches for deter¬ 
mining dialogue success. This was used to pro¬ 
vide a reinforcement signal for learning on-line 
from real users without requiring any prior knowl¬ 
edge of the user’s task ( |Su et al., 2015| ). Here 
we extend the RNN approach by introducing new 
training constraints in order to combine the merits 
of the above three works: (1) diffusing dialogue 
level ratings down to the turn level to (2) add re¬ 
ward shaping information for faster policy learn¬ 
ing, whilst (3) not requiring prior task knowledge 
which is simply unavailable on-line. 

In Section]^ we briefly describe potential based 
reward shaping before introducing the RNNs we 
explore for producing reward shaping signals (ba¬ 
sic RNN, long short-term memory (LSTM) and 
gated recurrent unit (GRU)). The features the 
RNNs use along with the training constraint and 
loss are also described. The experimental evalu¬ 
ation is then presented in Section Firstly, the 
estimation accuracy of the RNNs is assessed. The 
benefit of using the RNN for reward shaping in 
both simulated and real user scenarios is then also 
demonstrated. Finally, conclusions are presented 
in Section m 

2 RNNs for Reward Shaping 

2.1 Reward Shaping 

Reward shaping provides the system with an ex¬ 
tra reward signal F in addition to environmental 
reward i?, making the system learn from the com¬ 
posite signal R + F. The shaping reward F often 
encodes expert knowledge that complements the 
sparse signal R. Since the reward function defines 
the system’s objective, changing it may result in 
a different task. When the task is modelled as a 
Markov decision process (MDP), 
I defined formal requirements on 
the shaping reward as a difference of any potential 
function 0 on consecutive states s and s' which 
preserves the optimality of policies. Based on this 


fully observable 


Ng et al. (1999 



Figure 1: RNN with three types of hidden units: 
basic, LSTM and GRU. The feature vectors ex¬ 
tracted at turns f = 1,..., T are labelled f^. 


property, Eck et al. (2015[ ) further extended it to 
POMDP by proof and empirical experiments: 


F{hu a, 6t+i) = - 4>{h) (1) 


where 7 is the discount factor, bt the belief state at 
turn f, and a the action leading bt to bt^i. 

However determining an appropriate potential 
function for an SDS is non-trivial. Rather than 
hand-crafting the function with heuristic knowl¬ 
edge, we propose using an RNN to predict proper 
values as in the following. 


2.2 Recurrent Neural Network Models 


The RNN model is a subclass of neural network 
defined by the presence of feedback connections. 
The ability to succinctly retain history information 
makes it suitable for modelling sequential data. It 
has been successfully used for language modelling 
( [Mikolov et al., 20 iT] ) and spoken language under¬ 
standing ( [Mesnil et al., 2015| ). 

However, [Bengio et al. (1994| ) observed that ba¬ 
sic RNNs suffer from vanishing/exploding gradi¬ 
ent problems that limit their capability of mod¬ 
elling long context dependencies. To address this. 


long short-term memory (Hochreiter and Schmid- 


huber, 1997) and gated recurrent unit (Chung et 


al., 2014| ) RNNs have been proposed. In this pa¬ 
per, all three types of RNN (basic/LSTM/GRU) 
are compared. 


2.3 Reward Shaping with RNN Prediction 

The role of the RNN is to solve the regression 
problem of predicting the scalar return of each 
completed dialogue. At every turn t, input fea¬ 
ture ft are extracted from the belief/action pair and 
used to update the hidden layer ht. From dialogues 
generated by a simulated user ( [Schatzmann and 
Young, 200^ supervised training pairs are created 
which consist of the turn level sequence of these 
feature vectors ft along with the scalar dialogue 

















































return as scored by an objective measure of task 
completion. Whilst the RNN models are trained 
on dialogue level supervised targets, we hypothe¬ 
sise that their subsequent turn level predictions can 
guide policy exploration via acting as informative 
reward shaping potentials. 

To encourage good turn level predictions, all 
three RNN variants are trained to predict the di¬ 
alogue return not with the final output of the net¬ 
work, but with the constraint that their scalar out¬ 
puts from each turn t should sum to predict the 
return for the whole dialogue. This is shown in 
FigureA mean-square-error (MSB) loss is used 
(see Appendix[A|). The trained RNNs are then used 
directly as the reward shaping potential function 0, 
using the RNN scalar output at each turn. 

The feature inputs ft for all RNNs consisted of 
the following sections: the real-valued belief state 
vector formed by concatenating the distributions 
over user discourse act, method and goal variables 
( [Thomson and Young, 2010] ), one-hot encodings 
of the user and summary system actions, and the 
normalised turn number. This feature vector was 
extracted at every turn (system + user exchange). 


3 Experiments 


3.1 Experimental Setup 

In all experiments the Cambridge restaurant do¬ 
main was used, which consists of approximately 
150 venues each having 6 attributes (slots) of 
which 3 can be used by the system to constrain the 
search and the remaining 3 are informable proper¬ 
ties once a database entity has been found. 

The shared core components of the SDS in all 
experiments were a domain independent ASR, a 
confusion network (CNet) semantic input decoder 
( [Henderson et al., 2()T^ , the BUDS ( [Thomson and 


Young, 2010[ ) belief state tracker that factorises the 
dialogue state using a dynamic Bayesian network 
and a template based natural language generator. 


All policies were trained by GP-SARSA (Gasic 


and Young, 2014[ ) and the summary action space 
contains 20 actions. Per turn reward was set to -1 
and final reward 20 for task success else 0. 

With this ontology, the size of the full feature 
vector was 147. The turn number was expressed as 
a percentage of the maximum number of allowed 
turns, here 30. The one-hot user dialogue act en¬ 
coding was formed by taking only the most likely 
user act estimated by the CNet decoder. 


-O-training 18K, testA training IK, testA 
-□ -training 18K, testB -♦> training IK, testB 



Figure 2: RMSE of return prediction by using 
RNN/LSTM/GRU, trained on 18K and IK dia¬ 
logues and tested on sets testA and testB (see text). 

3.2 Neural Network Training 

Here results of training the 3 RNNs on the simu¬ 
lated user dialogues are presented]^ Two training 
sets were used consisting of 18K and IK dialogues 
to verify the model robustness. In all cases a sepa¬ 
rate validation set consisting of IK dialogues was 
used for controlling overfitting. Training and val¬ 
idation sets were approximately balanced regard¬ 
ing objective success/failure labels and collected 
at a 15% semantic error rate (SER). Prediction re¬ 
sults are shown in Figure]^ on two test sets; testA: 
IK dialogues, balanced regarding objective labels, 
at 15% SER and testB: containing 12K dialogues 
collected at SERs of 0,15, 30 and 45 as the data 
occurred {i.e. with no balancing regarding labels). 

Root-MSE (RMSE) results of predicting the di¬ 
alogue return are depicted in Figure]^ The models 
with LSTM and GRU units achieved a slight im¬ 
provement in most cases over the basic RNN. No¬ 
tice that the model with GRU even reached com¬ 
parable results when trained with IK training data 
compared to 18K. The results from the IK train¬ 
ing set indicate that the model can be developed 
from limited data. This enables datasets to be cre¬ 
ated by human annotation, avoiding the need for 
a simulated user. The results on set testB also 
show that the models can perform well in situa¬ 
tions with varying error rates as would be encoun¬ 
tered in real operating environments. Note that the 
dataset could also be created from human’s anno¬ 
tation which avoids the need for a simulated user. 
We next examine the RNN-based reward shaping 
for policy training with a simulated user. 

^All RNNs were implemented using the Theano library 
( [Bergstra et al., 201^ . In all cases the hidden layer contained 
100 units with a sigmoid non-linearity and used stochastic 
gradient descent (per dialogue) for training. 
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Figure 3: Policy training via simulated user with 
(GRU/HDC) and without (baseline) reward shap¬ 
ing. Standard errors are also shown. 


Figure 4: Learning curves of reward with standard 
errors during on-line policy optimisation for the 
baseline (black) and proposed (green) systems. 


3.3 Policy Learning with Simulated User 

Since the aim of reward shaping is to enhance 
policy learning speed, we focus on the first 1000 
training dialogues. Figure shows that the GRU 
RNN attained slightly better performance than the 
other two RNN models, albeit with no statistical 
significance. Thus for clearer presentation of the 
policy training results we plot only the GRU re¬ 
sults, using the model trained on 18K dialogues. 

To show the effectiveness of using RNN with 
GRU for predicting reward shaping potentials, we 
compare it with the hand-crafted (HDC) method 
for reward shaping proposed by [Ferreira and 
Lefevre (2013| ) that requires prior knowledge of 
the user’s task, and a baseline policy using only the 
environmental reward. Figure shows the learn¬ 
ing curve of the reward for the three systems. After 
every 50 training iterations each system was tested 
with 1000 dialogues and averaged over 10 poli¬ 
cies. The simulated user’s SER was set to 15%. 

We see that reward shaping indeed provides 
the agent with more information, increasing the 
learning speed. Furthermore, our proposed RNN 
method further outperforms the hand-crafted sys¬ 
tem, whilst also being able to be applied on-line. 

3.4 Policy Learning with Human Users 

Based on the above results, the same GRU model 
was selected for training a policy on-line with hu¬ 
mans. Two systems were trained with users re¬ 
cruited via Amazon Mechanical Turk: a baseline 
was trained with only the environmental reward, 
and another system was trained with an additional 
shaping reward predicted by the proposed GRU. 
Learning began from a random policy in all cases. 


Figure]^ shows the on-line learning curve of the 
reward when training the systems with 400 dia¬ 
logues. The moving average was calculated using 
a window of 100 dialogues and each result was av¬ 
eraged over three policies in order to reduce noise. 
It can be seen that by adding the RNN based shap¬ 
ing reward, the policy learnt quicker in the impor¬ 
tant initial phase of policy learning. 

4 Conclusions 


This paper has shown that RNN models can be 
trained to predict the dialogue return with a con¬ 
straint such that subsequent turn level predictions 
act as good reward shaping signals that are effec¬ 
tive for accelerating policy learning on-line with 
real users. As in many other applications, we 
found that gated RNNs such as LSTM and GRU 
perform a little better than basic RNNs. 

In the work described here, the RNNs were 
trained using a simulated user and this simulator 
could have been used to bootstrap a policy for 
use with real users. However our supposition is 
that RNNs could be trained for reward prediction 
which are substantially domain independent and 
hence have wider applications via domain adapta¬ 
tion and extension ( jGasic et al., 2015t Brys et al.. 


2015| ). Testing this supposition will be the subject 


of future work. 
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A Training Constraint/Loss Function 

For all RNN models the following MSE loss func¬ 
tion is used on a per-dialogue basis: 

MSE= (2) 

where the current dialogue has T turns, R is the 
return and training target, and rt is the scalar pre¬ 
diction output by the RNN model at each turn. 



